{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Linking a dataset of real historical persons with Deterrministic Rules\n",
        "\n",
        "While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.\n",
        "\n",
        "Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.\n",
        "\n",
        "In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at `Deduplicate 50k rows historical persons`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/examples/duckdb/deterministic_dedupe.ipynb\">\n",
        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
        "</a>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:10:59.567669Z",
          "iopub.status.busy": "2024-06-07T09:10:59.567311Z",
          "iopub.status.idle": "2024-06-07T09:10:59.591784Z",
          "shell.execute_reply": "2024-06-07T09:10:59.590923Z"
        }
      },
      "outputs": [],
      "source": [
        "# Uncomment and run this cell if you're running in Google Colab.\n",
        "# !pip install splink"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:10:59.595969Z",
          "iopub.status.busy": "2024-06-07T09:10:59.595667Z",
          "iopub.status.idle": "2024-06-07T09:11:01.007136Z",
          "shell.execute_reply": "2024-06-07T09:11:01.006553Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>unique_id</th>\n",
              "      <th>cluster</th>\n",
              "      <th>full_name</th>\n",
              "      <th>first_and_surname</th>\n",
              "      <th>first_name</th>\n",
              "      <th>surname</th>\n",
              "      <th>dob</th>\n",
              "      <th>birth_place</th>\n",
              "      <th>postcode_fake</th>\n",
              "      <th>gender</th>\n",
              "      <th>occupation</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Q2296770-1</td>\n",
              "      <td>Q2296770</td>\n",
              "      <td>thomas clifford, 1st baron clifford of chudleigh</td>\n",
              "      <td>thomas chudleigh</td>\n",
              "      <td>thomas</td>\n",
              "      <td>chudleigh</td>\n",
              "      <td>1630-08-01</td>\n",
              "      <td>devon</td>\n",
              "      <td>tq13 8df</td>\n",
              "      <td>male</td>\n",
              "      <td>politician</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Q2296770-2</td>\n",
              "      <td>Q2296770</td>\n",
              "      <td>thomas of chudleigh</td>\n",
              "      <td>thomas chudleigh</td>\n",
              "      <td>thomas</td>\n",
              "      <td>chudleigh</td>\n",
              "      <td>1630-08-01</td>\n",
              "      <td>devon</td>\n",
              "      <td>tq13 8df</td>\n",
              "      <td>male</td>\n",
              "      <td>politician</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Q2296770-3</td>\n",
              "      <td>Q2296770</td>\n",
              "      <td>tom 1st baron clifford of chudleigh</td>\n",
              "      <td>tom chudleigh</td>\n",
              "      <td>tom</td>\n",
              "      <td>chudleigh</td>\n",
              "      <td>1630-08-01</td>\n",
              "      <td>devon</td>\n",
              "      <td>tq13 8df</td>\n",
              "      <td>male</td>\n",
              "      <td>politician</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Q2296770-4</td>\n",
              "      <td>Q2296770</td>\n",
              "      <td>thomas 1st chudleigh</td>\n",
              "      <td>thomas chudleigh</td>\n",
              "      <td>thomas</td>\n",
              "      <td>chudleigh</td>\n",
              "      <td>1630-08-01</td>\n",
              "      <td>devon</td>\n",
              "      <td>tq13 8hu</td>\n",
              "      <td>None</td>\n",
              "      <td>politician</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Q2296770-5</td>\n",
              "      <td>Q2296770</td>\n",
              "      <td>thomas clifford, 1st baron chudleigh</td>\n",
              "      <td>thomas chudleigh</td>\n",
              "      <td>thomas</td>\n",
              "      <td>chudleigh</td>\n",
              "      <td>1630-08-01</td>\n",
              "      <td>devon</td>\n",
              "      <td>tq13 8df</td>\n",
              "      <td>None</td>\n",
              "      <td>politician</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    unique_id   cluster                                         full_name  \\\n",
              "0  Q2296770-1  Q2296770  thomas clifford, 1st baron clifford of chudleigh   \n",
              "1  Q2296770-2  Q2296770                               thomas of chudleigh   \n",
              "2  Q2296770-3  Q2296770               tom 1st baron clifford of chudleigh   \n",
              "3  Q2296770-4  Q2296770                              thomas 1st chudleigh   \n",
              "4  Q2296770-5  Q2296770              thomas clifford, 1st baron chudleigh   \n",
              "\n",
              "  first_and_surname first_name    surname         dob birth_place  \\\n",
              "0  thomas chudleigh     thomas  chudleigh  1630-08-01       devon   \n",
              "1  thomas chudleigh     thomas  chudleigh  1630-08-01       devon   \n",
              "2     tom chudleigh        tom  chudleigh  1630-08-01       devon   \n",
              "3  thomas chudleigh     thomas  chudleigh  1630-08-01       devon   \n",
              "4  thomas chudleigh     thomas  chudleigh  1630-08-01       devon   \n",
              "\n",
              "  postcode_fake gender  occupation  \n",
              "0      tq13 8df   male  politician  \n",
              "1      tq13 8df   male  politician  \n",
              "2      tq13 8df   male  politician  \n",
              "3      tq13 8hu   None  politician  \n",
              "4      tq13 8df   None  politician  "
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import pandas as pd\n",
        "\n",
        "from splink import splink_datasets\n",
        "\n",
        "pd.options.display.max_rows = 1000\n",
        "df = splink_datasets.historical_50k\n",
        "df.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When defining the settings object, specity your deterministic rules in the `blocking_rules_to_generate_predictions` key.\n",
        "\n",
        "For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define `comparisons` nor any other parameters required for model training in a probabilistic model.\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:01.050336Z",
          "iopub.status.busy": "2024-06-07T09:11:01.049679Z",
          "iopub.status.idle": "2024-06-07T09:11:01.602823Z",
          "shell.execute_reply": "2024-06-07T09:11:01.601902Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "  #altair-viz-7573355cdaeb4eda997ecaf67fa26c1e.vega-embed {\n",
              "    width: 100%;\n",
              "    display: flex;\n",
              "  }\n",
              "\n",
              "  #altair-viz-7573355cdaeb4eda997ecaf67fa26c1e.vega-embed details,\n",
              "  #altair-viz-7573355cdaeb4eda997ecaf67fa26c1e.vega-embed details summary {\n",
              "    position: relative;\n",
              "  }\n",
              "</style>\n",
              "<div id=\"altair-viz-7573355cdaeb4eda997ecaf67fa26c1e\"></div>\n",
              "<script type=\"text/javascript\">\n",
              "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
              "  (function(spec, embedOpt){\n",
              "    let outputDiv = document.currentScript.previousElementSibling;\n",
              "    if (outputDiv.id !== \"altair-viz-7573355cdaeb4eda997ecaf67fa26c1e\") {\n",
              "      outputDiv = document.getElementById(\"altair-viz-7573355cdaeb4eda997ecaf67fa26c1e\");\n",
              "    }\n",
              "    const paths = {\n",
              "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
              "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
              "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
              "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
              "    };\n",
              "\n",
              "    function maybeLoadScript(lib, version) {\n",
              "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
              "      return (VEGA_DEBUG[key] == version) ?\n",
              "        Promise.resolve(paths[lib]) :\n",
              "        new Promise(function(resolve, reject) {\n",
              "          var s = document.createElement('script');\n",
              "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
              "          s.async = true;\n",
              "          s.onload = () => {\n",
              "            VEGA_DEBUG[key] = version;\n",
              "            return resolve(paths[lib]);\n",
              "          };\n",
              "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
              "          s.src = paths[lib];\n",
              "        });\n",
              "    }\n",
              "\n",
              "    function showError(err) {\n",
              "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
              "      throw err;\n",
              "    }\n",
              "\n",
              "    function displayChart(vegaEmbed) {\n",
              "      vegaEmbed(outputDiv, spec, embedOpt)\n",
              "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
              "    }\n",
              "\n",
              "    if(typeof define === \"function\" && define.amd) {\n",
              "      requirejs.config({paths});\n",
              "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
              "    } else {\n",
              "      maybeLoadScript(\"vega\", \"5\")\n",
              "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
              "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
              "        .catch(showError)\n",
              "        .then(() => displayChart(vegaEmbed));\n",
              "    }\n",
              "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"data\": {\"name\": \"data-b9acee0cea6dec7a8f0f5f5c44419bdf\"}, \"mark\": \"bar\", \"encoding\": {\"order\": {\"field\": \"cumulative_rows\"}, \"tooltip\": [{\"field\": \"blocking_rule\", \"title\": \"SQL Condition\", \"type\": \"nominal\"}, {\"field\": \"row_count\", \"format\": \",\", \"title\": \"Comparisons Generated\", \"type\": \"quantitative\"}, {\"field\": \"cumulative_rows\", \"format\": \",\", \"title\": \"Cumulative Comparisons\", \"type\": \"quantitative\"}, {\"field\": \"cartesian\", \"format\": \",\", \"title\": \"Total comparisons in Cartesian product\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"start\", \"title\": \"Comparisons Generated by Rule(s)\", \"type\": \"quantitative\"}, \"x2\": {\"field\": \"cumulative_rows\"}, \"y\": {\"field\": \"blocking_rule\", \"sort\": [\"-x2\"], \"title\": \"SQL Blocking Rule\"}}, \"height\": {\"step\": 20}, \"title\": {\"text\": \"Count of Additional Comparisons Generated by Each Blocking Rule\", \"subtitle\": \"(Counts exclude comparisons already generated by previous rules)\"}, \"width\": 450, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\", \"datasets\": {\"data-b9acee0cea6dec7a8f0f5f5c44419bdf\": [{\"blocking_rule\": \"(l.\\\"first_name\\\" = r.\\\"first_name\\\") AND (l.\\\"surname\\\" = r.\\\"surname\\\") AND (l.\\\"dob\\\" = r.\\\"dob\\\")\", \"row_count\": 37852, \"cumulative_rows\": 37852, \"cartesian\": 1279041753, \"match_key\": \"0\", \"start\": 0}, {\"blocking_rule\": \"(l.\\\"surname\\\" = r.\\\"surname\\\") AND (l.\\\"dob\\\" = r.\\\"dob\\\") AND (l.\\\"postcode_fake\\\" = r.\\\"postcode_fake\\\")\", \"row_count\": 9568, \"cumulative_rows\": 47420, \"cartesian\": 1279041753, \"match_key\": \"1\", \"start\": 37852}, {\"blocking_rule\": \"(l.\\\"first_name\\\" = r.\\\"first_name\\\") AND (l.\\\"dob\\\" = r.\\\"dob\\\") AND (l.\\\"occupation\\\" = r.\\\"occupation\\\")\", \"row_count\": 4306, \"cumulative_rows\": 51726, \"cartesian\": 1279041753, \"match_key\": \"2\", \"start\": 47420}]}}, {\"mode\": \"vega-lite\"});\n",
              "</script>"
            ],
            "text/plain": [
              "alt.Chart(...)"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from splink import DuckDBAPI, block_on\n",
        "from splink.blocking_analysis import (\n",
        "    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n",
        ")\n",
        "\n",
        "db_api = DuckDBAPI()\n",
        "cumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n",
        "    table_or_tables=df,\n",
        "    blocking_rules=[\n",
        "        block_on(\"first_name\", \"surname\", \"dob\"),\n",
        "        block_on(\"surname\", \"dob\", \"postcode_fake\"),\n",
        "        block_on(\"first_name\", \"dob\", \"occupation\"),\n",
        "    ],\n",
        "    db_api=db_api,\n",
        "    link_type=\"dedupe_only\",\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:01.606853Z",
          "iopub.status.busy": "2024-06-07T09:11:01.606539Z",
          "iopub.status.idle": "2024-06-07T09:11:01.691839Z",
          "shell.execute_reply": "2024-06-07T09:11:01.690988Z"
        }
      },
      "outputs": [],
      "source": [
        "from splink import Linker, SettingsCreator\n",
        "\n",
        "settings = SettingsCreator(\n",
        "    link_type=\"dedupe_only\",\n",
        "    blocking_rules_to_generate_predictions=[\n",
        "        block_on(\"first_name\", \"surname\", \"dob\"),\n",
        "        block_on(\"surname\", \"dob\", \"postcode_fake\"),\n",
        "        block_on(\"first_name\", \"dob\", \"occupation\"),\n",
        "    ],\n",
        "    retain_intermediate_calculation_columns=True,\n",
        ")\n",
        "\n",
        "linker = Linker(df, settings, db_api=db_api)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The results of the linkage can be viewed with the `deterministic_link` function.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:01.695906Z",
          "iopub.status.busy": "2024-06-07T09:11:01.695600Z",
          "iopub.status.idle": "2024-06-07T09:11:01.995020Z",
          "shell.execute_reply": "2024-06-07T09:11:01.994289Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>unique_id_l</th>\n",
              "      <th>unique_id_r</th>\n",
              "      <th>occupation_l</th>\n",
              "      <th>occupation_r</th>\n",
              "      <th>first_name_l</th>\n",
              "      <th>first_name_r</th>\n",
              "      <th>dob_l</th>\n",
              "      <th>dob_r</th>\n",
              "      <th>surname_l</th>\n",
              "      <th>surname_r</th>\n",
              "      <th>postcode_fake_l</th>\n",
              "      <th>postcode_fake_r</th>\n",
              "      <th>match_key</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Q55455287-12</td>\n",
              "      <td>Q55455287-2</td>\n",
              "      <td>None</td>\n",
              "      <td>writer</td>\n",
              "      <td>jaido</td>\n",
              "      <td>jaido</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>morata</td>\n",
              "      <td>morata</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>ta4 2uu</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Q55455287-12</td>\n",
              "      <td>Q55455287-3</td>\n",
              "      <td>None</td>\n",
              "      <td>writer</td>\n",
              "      <td>jaido</td>\n",
              "      <td>jaido</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>morata</td>\n",
              "      <td>morata</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>ta4 2uu</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Q55455287-12</td>\n",
              "      <td>Q55455287-4</td>\n",
              "      <td>None</td>\n",
              "      <td>writer</td>\n",
              "      <td>jaido</td>\n",
              "      <td>jaido</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>morata</td>\n",
              "      <td>morata</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>ta4 2sz</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Q55455287-12</td>\n",
              "      <td>Q55455287-5</td>\n",
              "      <td>None</td>\n",
              "      <td>None</td>\n",
              "      <td>jaido</td>\n",
              "      <td>jaido</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>morata</td>\n",
              "      <td>morata</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Q55455287-12</td>\n",
              "      <td>Q55455287-6</td>\n",
              "      <td>None</td>\n",
              "      <td>writer</td>\n",
              "      <td>jaido</td>\n",
              "      <td>jaido</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>1836-01-01</td>\n",
              "      <td>morata</td>\n",
              "      <td>morata</td>\n",
              "      <td>ta4 2ug</td>\n",
              "      <td>None</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    unique_id_l  unique_id_r occupation_l occupation_r first_name_l  \\\n",
              "0  Q55455287-12  Q55455287-2         None       writer        jaido   \n",
              "1  Q55455287-12  Q55455287-3         None       writer        jaido   \n",
              "2  Q55455287-12  Q55455287-4         None       writer        jaido   \n",
              "3  Q55455287-12  Q55455287-5         None         None        jaido   \n",
              "4  Q55455287-12  Q55455287-6         None       writer        jaido   \n",
              "\n",
              "  first_name_r       dob_l       dob_r surname_l surname_r postcode_fake_l  \\\n",
              "0        jaido  1836-01-01  1836-01-01    morata    morata         ta4 2ug   \n",
              "1        jaido  1836-01-01  1836-01-01    morata    morata         ta4 2ug   \n",
              "2        jaido  1836-01-01  1836-01-01    morata    morata         ta4 2ug   \n",
              "3        jaido  1836-01-01  1836-01-01    morata    morata         ta4 2ug   \n",
              "4        jaido  1836-01-01  1836-01-01    morata    morata         ta4 2ug   \n",
              "\n",
              "  postcode_fake_r match_key  \n",
              "0         ta4 2uu         0  \n",
              "1         ta4 2uu         0  \n",
              "2         ta4 2sz         0  \n",
              "3         ta4 2ug         0  \n",
              "4            None         0  "
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df_predict = linker.inference.deterministic_link()\n",
        "df_predict.as_pandas_dataframe().head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Which can be used to generate clusters.\n",
        "\n",
        "Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set `threshold_match_probability=1` in the `cluster_pairwise_predictions_at_threshold` function.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:01.998965Z",
          "iopub.status.busy": "2024-06-07T09:11:01.998665Z",
          "iopub.status.idle": "2024-06-07T09:11:02.348788Z",
          "shell.execute_reply": "2024-06-07T09:11:02.348039Z"
        }
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Completed iteration 1, root rows count 94\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Completed iteration 2, root rows count 10\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Completed iteration 3, root rows count 0\n"
          ]
        }
      ],
      "source": [
        "clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n",
        "    df_predict\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:02.352872Z",
          "iopub.status.busy": "2024-06-07T09:11:02.352366Z",
          "iopub.status.idle": "2024-06-07T09:11:02.367858Z",
          "shell.execute_reply": "2024-06-07T09:11:02.367179Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cluster_id</th>\n",
              "      <th>unique_id</th>\n",
              "      <th>cluster</th>\n",
              "      <th>full_name</th>\n",
              "      <th>first_and_surname</th>\n",
              "      <th>first_name</th>\n",
              "      <th>surname</th>\n",
              "      <th>dob</th>\n",
              "      <th>birth_place</th>\n",
              "      <th>postcode_fake</th>\n",
              "      <th>gender</th>\n",
              "      <th>occupation</th>\n",
              "      <th>__splink_salt</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Q16025107-1</td>\n",
              "      <td>Q5497940-9</td>\n",
              "      <td>Q5497940</td>\n",
              "      <td>frederick hall</td>\n",
              "      <td>frederick hall</td>\n",
              "      <td>frederick</td>\n",
              "      <td>hall</td>\n",
              "      <td>1855-01-01</td>\n",
              "      <td>bristol, city of</td>\n",
              "      <td>bs11 9pn</td>\n",
              "      <td>None</td>\n",
              "      <td>None</td>\n",
              "      <td>0.002739</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Q1149445-1</td>\n",
              "      <td>Q1149445-9</td>\n",
              "      <td>Q1149445</td>\n",
              "      <td>earl egerton</td>\n",
              "      <td>earl egerton</td>\n",
              "      <td>earl</td>\n",
              "      <td>egerton</td>\n",
              "      <td>1800-01-01</td>\n",
              "      <td>westminster</td>\n",
              "      <td>w1d 2hf</td>\n",
              "      <td>None</td>\n",
              "      <td>None</td>\n",
              "      <td>0.991459</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Q20664532-1</td>\n",
              "      <td>Q21466387-2</td>\n",
              "      <td>Q21466387</td>\n",
              "      <td>harry brooker</td>\n",
              "      <td>harry brooker</td>\n",
              "      <td>harry</td>\n",
              "      <td>brooker</td>\n",
              "      <td>1848-01-01</td>\n",
              "      <td>plymouth</td>\n",
              "      <td>pl4 9hx</td>\n",
              "      <td>male</td>\n",
              "      <td>painter</td>\n",
              "      <td>0.506127</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Q1124636-1</td>\n",
              "      <td>Q1124636-12</td>\n",
              "      <td>Q1124636</td>\n",
              "      <td>tom stapleton</td>\n",
              "      <td>tom stapleton</td>\n",
              "      <td>tom</td>\n",
              "      <td>stapleton</td>\n",
              "      <td>1535-01-01</td>\n",
              "      <td>None</td>\n",
              "      <td>bn6 9na</td>\n",
              "      <td>male</td>\n",
              "      <td>theologian</td>\n",
              "      <td>0.612694</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Q18508292-1</td>\n",
              "      <td>Q21466711-4</td>\n",
              "      <td>Q21466711</td>\n",
              "      <td>harry s0ence</td>\n",
              "      <td>harry s0ence</td>\n",
              "      <td>harry</td>\n",
              "      <td>s0ence</td>\n",
              "      <td>1860-01-01</td>\n",
              "      <td>london</td>\n",
              "      <td>se1 7pb</td>\n",
              "      <td>male</td>\n",
              "      <td>painter</td>\n",
              "      <td>0.488917</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    cluster_id    unique_id    cluster       full_name first_and_surname  \\\n",
              "0  Q16025107-1   Q5497940-9   Q5497940  frederick hall    frederick hall   \n",
              "1   Q1149445-1   Q1149445-9   Q1149445    earl egerton      earl egerton   \n",
              "2  Q20664532-1  Q21466387-2  Q21466387   harry brooker     harry brooker   \n",
              "3   Q1124636-1  Q1124636-12   Q1124636   tom stapleton     tom stapleton   \n",
              "4  Q18508292-1  Q21466711-4  Q21466711    harry s0ence      harry s0ence   \n",
              "\n",
              "  first_name    surname         dob       birth_place postcode_fake gender  \\\n",
              "0  frederick       hall  1855-01-01  bristol, city of      bs11 9pn   None   \n",
              "1       earl    egerton  1800-01-01       westminster       w1d 2hf   None   \n",
              "2      harry    brooker  1848-01-01          plymouth       pl4 9hx   male   \n",
              "3        tom  stapleton  1535-01-01              None       bn6 9na   male   \n",
              "4      harry     s0ence  1860-01-01            london       se1 7pb   male   \n",
              "\n",
              "   occupation  __splink_salt  \n",
              "0        None       0.002739  \n",
              "1        None       0.991459  \n",
              "2     painter       0.506127  \n",
              "3  theologian       0.612694  \n",
              "4     painter       0.488917  "
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "clusters.as_pandas_dataframe(limit=5)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "These results can then be passed into the `Cluster Studio Dashboard`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-06-07T09:11:02.371850Z",
          "iopub.status.busy": "2024-06-07T09:11:02.371545Z",
          "iopub.status.idle": "2024-06-07T09:11:02.462645Z",
          "shell.execute_reply": "2024-06-07T09:11:02.461886Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "        <iframe\n",
              "            width=\"100%\"\n",
              "            height=\"1200\"\n",
              "            src=\"./dashboards/50k_deterministic_cluster.html\"\n",
              "            frameborder=\"0\"\n",
              "            allowfullscreen\n",
              "            \n",
              "        ></iframe>\n",
              "        "
            ],
            "text/plain": [
              "<IPython.lib.display.IFrame at 0x12edeee60>"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "linker.visualisations.cluster_studio_dashboard(\n",
        "    df_predict,\n",
        "    clusters,\n",
        "    \"dashboards/50k_deterministic_cluster.html\",\n",
        "    sampling_method=\"by_cluster_size\",\n",
        "    overwrite=True,\n",
        ")\n",
        "\n",
        "from IPython.display import IFrame\n",
        "\n",
        "IFrame(src=\"./dashboards/50k_deterministic_cluster.html\", width=\"100%\", height=1200)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}
